Add Presidio text anonymization scaffold#233
Conversation
karthiksathishjeemain
left a comment
There was a problem hiding this comment.
Thanks for the PR Suryansh. I like the approach of introducing two different steps of removing PHI (Pseudonymize and Redact). However there is a security bug I have mentioned. Please rectify it and once it is done, I will approve the PR.
| from typing import Any, Dict, Iterable, Iterator, List, Tuple | ||
|
|
||
| SUPPORTED_PROFILES = {"strict", "research"} | ||
| DEFAULT_STUDY_SALT = "bio-block-week2-development-salt" |
There was a problem hiding this comment.
Please keep the DEFAULT_STUDY_SALT in a secret file like .env. Even though SHA-256 is irreversible, the attacker can still match the hash by generating all possible ones (patient ID is a low-entropy field i.e usually it ranges from 1 - 10^6 and generating these many hash is easier when you already know the salt).
There was a problem hiding this comment.
Implemented the changes @karthiksathishjeemain. Please have a look once again and let me know your feedback. Thankyou so much :)
karthiksathishjeemain
left a comment
There was a problem hiding this comment.
Thanks for the changes @XxSURYANSHxX. @pradeeban please merge this PR. Good work!!
|
Merged. Thanks. |
@pradeeban, @karthiksathishjeemain, @Chali-healthy. This PR starts the Week 2 work for my GSoC project:
Bio-Block: Advanced PHI Anonymization and Hybrid Data Retrieval Pipeline
The main goal of this PR is to replace the Week 1 text placeholder handler with a real clinical text anonymization pipeline.
In Week 1, the ingestion endpoint could detect file modality and route files to the correct handler, but all handlers were still placeholders. This PR makes only the text handler real.
After this PR:
This keeps the PR small, focused, and limited to the Week 2 text anonymization scope.
Why This PR Is Needed
The Week 2 proposal scope is:
Before this PR, the text ingestion flow only selected a placeholder handler. It did not actually anonymize clinical text.
This PR adds the first real anonymization path for uploaded text files while avoiding unrelated backend changes.
Main Changes
1. Added a Clinical Text Anonymization Service
A new service was added:
This service exposes a focused function:
The function:
Example response shape:
{ "anonymization_status": "completed", "anonymized_text": "Patient has MRN_A1B2C3D4 and email <REDACTED_EMAIL>.", "detected_entities": { "MEDICAL_RECORD_NUMBER": 1, "EMAIL_ADDRESS": 1 } }Presidio Integration
This PR uses Microsoft Presidio for the text anonymization pipeline.
The new service uses Presidio's
AnalyzerEnginewith customPatternRecognizerinstances.One important implementation detail is that the new Week 2 text service avoids hidden runtime model downloads.
Presidio's default setup can try to load or download a spaCy model if none is available. To avoid that, this PR uses a blank local spaCy tokenizer for pattern-based recognition. This keeps the new text ingestion path deterministic and safe for local development and CI.
No spaCy model download is required for this new Week 2 text ingestion path.
Custom Clinical Recognizers Added
This PR adds custom recognizers for clinical identifiers that general PII recognizers may miss.
The added clinical entity types are:
MEDICAL_RECORD_NUMBERPATIENT_IDHEALTH_PLAN_IDACCESSION_NUMBERDEVICE_IDThe recognizers are designed to be conservative.
For example, this should be detected:
But this should not be blindly treated as an MRN:
This is important because clinical notes often contain many numbers that are not identifiers.
Medical Record Number Recognition
The MRN recognizer supports clinical context such as:
MRNmedical recordmedical record numberhospital numberchart numberSupported examples include:
The recognizer is case-insensitive, so both of these are supported:
Patient ID Recognition
The Patient ID recognizer supports context such as:
patient idpatient numberpatient identifierpt idExample:
The raw ID is replaced with a deterministic surrogate like:
Health Plan / Insurance ID Recognition
The health plan recognizer supports context such as:
health planbeneficiaryinsurancepolicymember idsubscriber idExample:
The raw value is replaced with a deterministic surrogate like:
Additional Clinical Recognizers
This PR also adds recognizers for:
Accession Number
Example contexts:
accessionaccession numberacc noaccession noReplacement format:
Device ID / Serial Number
Example contexts:
devicedevice idserialserial numberimplantequipmentReplacement format:
Deterministic Surrogate Generation
This PR adds deterministic surrogate generation using salted SHA-256 hashes.
The behavior is:
Examples:
becomes something like:
becomes something like:
becomes something like:
The implementation uses SHA-256 from the Python standard library.
Common PHI Handling
This PR also handles several common PHI patterns.
Email Addresses
Email addresses are replaced with:
Example:
becomes:
Phone Numbers
Phone numbers are replaced with:
Example:
becomes:
SSNs
SSN-like values are replaced with:
Dates
Common date patterns are currently replaced with:
Date shifting is not implemented in this first Week 2 slice. This PR keeps date handling simple and safe by redacting common date formats for now.
Ingestion Endpoint Integration
The Week 1 ingestion endpoint is:
This PR updates the ingestion flow so that text files now go through the real anonymization service.
For text uploads, the endpoint now:
anonymization_status: "completed".For non-text uploads, the endpoint still returns the Week 1 placeholder response.
Response Safety
This PR is careful about not exposing raw PHI.
The API response may include:
The API response does not include:
The entity summary only includes entity types and counts.
Safe example:
{ "MEDICAL_RECORD_NUMBER": 1, "EMAIL_ADDRESS": 1 }Unsafe example not used by this PR:
{ "detected_values": ["123456", "john.doe@example.com"] }Text Upload Size Guard
This PR adds a text upload size guard:
If a text upload is larger than the limit, the endpoint returns a clear error.
This avoids accidentally reading very large text files into memory during this first implementation slice.
Streaming or chunked anonymization is not implemented yet.
UTF-8 Handling
Text files are decoded as UTF-8.
If the uploaded text file is not valid UTF-8, the endpoint returns a clear error instead of failing with an internal exception.
Example error:
What Was Intentionally Not Changed
This PR does not implement any non-text anonymization work.
Not included:
This PR also does not touch:
/store/store_enhancedThe goal was to keep this PR focused only on Week 2 text anonymization.
Files Changed
Added
Updated
Tests Added
A new service-level test file was added:
The ingestion test file was also updated:
The tests cover:
/api/v1/ingest.txtextensionTest Results
I ran the focused Week 2 tests.
Text anonymization service tests
Command:
Result:
Ingestion endpoint tests
Command:
Result:
There were a few existing dependency warnings during the ingestion tests, but there were no test failures.
Dependency Notes
No new dependencies were added in this PR.
The required packages were already present in:
Already present:
No spaCy model download is required for the new Week 2 text ingestion path.
Current Behavior by Modality
Text
Text now uses real anonymization.
Status:
CSV
CSV still uses the placeholder handler.
Status:
DICOM
DICOM still uses the placeholder handler.
Status:
NIfTI
NIfTI still uses the placeholder handler.
Status:
WSI
WSI still uses the placeholder handler.
Status:
Current Limitations
This is the first Week 2 implementation slice, so a few things are intentionally left for later.
Current limitations:
study_saltstill needs mentor confirmation.Privacy Notes
This PR uses only synthetic test examples.
Examples used in tests include fake values such as:
No real PHI was added.
The implementation does not print raw uploaded text or raw detected values.
The entity summary is safe because it only contains counts by entity type.
Please let me know your feedback and if there are any changes required, I will make more commits to this as i improve it even further. Thankyou so much :)